1 Problem 4: Predictive model building: California housing

Our purpose is to build the best predictive model to forecast the median house value in California. According to exercise description, in the beginning we have to conduct standardized processing for totalRooms and totalBedrooms by creating the new variables including sdrooms and sdbedrooms. Then we utilized linear regression (with stepwise variable selection) and tree models to build the best predictive model.

Now, we list their RMSEs to find out which model is the best predictive model.

Table 1.1: The RMSEs of Models
OLS Stepwise CART Random Forest Boosting
72506.02 69213.52 61432.45 51231.49 72836.24

Table 1.1 shows that the random forest model has the lowest RMSE, so we employ it to build our best predictive model.

Actual Median House Value in California

Figure 1.1: Actual Median House Value in California

Prediction Value in California

Figure 1.2: Prediction Value in California

Model's error

Figure 1.3: Model’s error

We can find that above graphs perform well, because 1.1 almost match 1.2. Therefore, we can confirm that the random forest model does well in the prediction. In conclusion, our predictive model is effective, then we may use this predictive model to forecast other states or areas in United State.